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Abstract 

This paper 0 presents a practical, and theoretically well-founded, approach to improve the 
speed of kernel manifold learning algorithms relying on spectral decomposition. Utilizing recent 
insights in kernel smoothing and learning with integral operators, we propose Reduced Set 
KPCA (RSKPCA), which also suggests an easy-to-implement method to remove or replace 
samples with minimal effect on the empirical operator. A simple data point selection procedure 
is given to generate a substitute density for the data, with accuracy that is governed by a user- 
tunable parameter l. The effect of the approximation on the quality of the KPCA solution, in 
terms of spectral and operator errors, can be shown directly in terms of the density estimate 
error and as a function of the parameter t. We show in experiments that RSKPCA can improve 
both training and evaluation time of KPCA by up to an order of magnitude, and compares 
favorably to the widely-used Nystrom and density-weighted Nystrom methods. 


1 Introduction 

Modern problems in machine learning are characterized by large, often redundant, high-dimensional 
datasets. To interpret and more effectively use high-dimensional data, a simplifying assumption 
often made is that the data lies on an embedded manifold. Recovery of the underlying manifold 
aids certain machine learning problems such as deriving a classifier from the data, or estimating 
a function of interest. Algorithms that try to recover this underlying structure within the field of 
manifold learning include methods such as Laplacian eigenmaps [3] and diffusion maps [6j. Many 
such methods can be thought of as Kernel PCA (KPCA) |il2j performed on specially constructed 
kernel matrices jS]. We denote this class of methods as Kernel Manifold Learning Algorithms. For 
a dataset with n points, KMLAs involve the eigendecomposition of an n x n kernel matrix K, 
and a manifold mapping of order 0(n) in cost (for a dataset with n points), which limits their 
usefulness in some application domains (e.g., online learning and visual tracking). In addition to 
the computational cost, storage of the kernel matrix in memory becomes difficult for larger datasets, 
particularly for kernels such as the Gaussian which tends to generate dense matrices. Therefore 
a truly scalable KMLA method should be one that 1) avoids the computation of the full kernel 
matrix, 2) has low training cost, and 3) has low testing cost. 

Existing methods for speeding up the computation time of KMLAs focus on the training and 
testing phases separately. Examples of the former include methods such as Incomplete Cholesky 
Decomposition (ICD) |T3], the Nystrom method [7j and random projectionsp.], which compute a low 
rank approximation of the kernel matrix in terms of the original dataset with n points and a subset 
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of m points (see |20j and the references therein). While exhibiting excellent performance, ICD, 
random projections and certain Nystrom methods require the computation of the kernel matrix. 
An example of a Nystrom method that does not require the computation of a kernel matrix is 
one where the centers are chosen uniformly from the data. While performing well in practice, it 
suffers from the lack of a principled way to choose the number of centers. Related work in the 
class is [20], which employs /c-means clustering and a density-weighted Gram matrix for performing 
KPCA. Drawbacks to the approach include the use of k- means, which also requires the number 
of clusters in advance and can be slow in high dimensions (due to its iterative nature); and an 
asymmetric weighted Gram matrix. Further, both methods require the retention of the full dataset 
for computing projections; while the training cost may be lower, the testing cost remains the same. 

Methods to reduce the testing cost include reduced set selection and sparse selection methods, 
which find a reduced set of expansion vectors from the original space that approximate well the 
training set [HI US], reduced set construction, which identifies new elements of the input space that 
approximate well the training set m, and kernel map compression, which uses generalized radial 
basis function networks to approximate the kernel map [2|. Given that the full eigendecomposition 
is typically required, these methods tend to be expensive in training, but can reduce the testing 
cost significantly. 

Approach. To the authors’ knowledge, no method exists which considers speeding up both 
training and testing of KMLAs in a unified and principled manner. This paper proposes to do so 
by connecting kernel principal component analysis to the eigendecomposition of kernel smoothing 
operators. In particular, given a sampled data set {xi}™, we show that the spectral decomposition 
of the Gram matrix K is related to the kernel density estimate p{x). If an approximation p(x) 
is available whose cardinality is much lower than that of p(x), an approximation to the original 
Gram matrix can be computed at a significantly reduced computational cost, thus improving the 
execution of KMLAs. 

Contribution. There are two main contributions in the paper. This paper first exploits the 
connection of kernel smoothing to the spectral decomposition of integral operators, within the 
context of kernel principal component analysis (KPCA), to define reduced set KPCA (RSKCPA). 
RSKPCA relies on the existence of a reduced set density estimate (RSDE) of the dataset, with a 
cardinality of m rather than n (where m <n). The RSDE defines a weighted to x m Gram matrix 
K, whose eigendecomposition is computed in lieu of the empirical Gram matrix K. The RSKPCA 
approach circumvents the computation of the full kernel matrix so that the eigendecomposition is 
of order 0(m 3 ) cost instead of 0(n 3 ). Evaluation time is also reduced, as mapping a test point into 
the reduced eigenspace requires 0(km ) operations rather than O(kn), with k retained eigenvectors. 

While many methods can be used to generate the reduced set approximation p{x) to the empir¬ 
ical density p(x), efficient methods are preferred in order to truly impact the overall training time. 
This paper proposes a simple, fast, single-pass method relying on the concept of the ‘shadow’ of a 
radially-symmetric kernel to generate the approximation p(x), called the shadow density estimate 
(ShDE). The ShDE depends on a user-tuned parameter ( to arrive at an RSDE of cardinality 
to. <C n with a run-time cost of 0(mn). Unlike previous work where to is chosen arbitrarily, i is 
related to the kernel, and can generally be set to a generic value (say t = 4) for a wide variety of 
problems. 

The shadow algorithm enables the derivation of closed form error bounds of the RSDE and 
RSKPCA results. Results bounding (1) the approximation of the density via the Maximum Mean 
Discrepancy (MMD), (2) the eigenvalue difference between the operators K and K, and (3) the 
difference in Hilbert-Schmidt norm between the operators and their eigenspace projections, provide 
further theoretical justification for the approach. The bounds are given in terms of the user- 
tuned parameter t. The latter two bounds are shown to be directly related to the first bound, 
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indicating the importance of the density estimate in generating a correct eigendecomposition. The 
proposed approach performs well as a substitute for the Nystrom family of algorithms. While the 
application of choice in this paper is KMLAs, the method is applicable any problem which satisfies 
the assumptions and which can be formulated as a kernel eigenvalue problem. 

Organization. Section [2] reviews the operator view of KPCA. Theoretical support for re¬ 
duced set KPCA (RSKPCA) follows in Section [3j which uses the connection to kernel smoothing 
to define RSKPCA. Section [4] defines the shadow of the kernel from which the shadow density 
estimate (ShDE) is derived and used in the RSKPCA algorithm. Section [5] provides error bounds 
on the MMD distance between the KDE and the ShDE, and the approximation of the operator 
by RSKPCA. Section [6] reports experimental results, which show the efficacy of the method on 
speeding up KPCA and KPCA-based methods. 


2 KPCA and Eigenfunction Learning 


This section briefly summarizes the foundations of KPCA as regards the spectral decomposition of 
operators. To start, let k : x —> M be a bounded, positive-definite kernel function, defined on 
the domain D C M rf . Then k has the property k(x,y) = i/j(y))u where % is a Reproducing 

Kernel Hilbert space and —y T~L is an implicit mapping. The kernel induces a linear operator 

K : C 2 (D) ->■ C 2 (D), 

(K f)(x) := f k(x, y)f(y)dy. (1) 

JD 

To incorporate data arising from a probability density p(x), ([Tj) can be modified. Let p, be a 
probability measure on D associated to p, and denote by C?(D,p) the space of square integrable 
functions with norm \\f\\ 2 = (f,f) P = f D f(x) 2 dp(x). Define the linear operator K : C 2 (D,y) —> 
C(D.//) by 


(K f)(x):= k(x, y)f(y)p(y)dy. (2) 

JD 

The operator K is associated to the eigenproblem 

/ k(x,y)p(x)(/) L (x)dx = A t (/> t (y), (3) 

JD 

where are the eigenfunctions. In practice, given a sample set X = {xj}” drawn from p(x), the 
empirical approximation to © is derived from the approximation 


/ k(x, y)p(x)4> L (x)dx « - V] k(xi, y)4>i{xi), 

Jd n i =i 

as obtained from the empirical estimate of the probability density p(x) using X , 

1 n 

p(x) « -J2s(xi,x), 

rn { ^ 


n z - 
i —1 


( 4 ) 


( 5 ) 


which employs the sampling property of the delta function. Equation Q then leads to the eigen¬ 
decomposition of the Gram matrix K 

K(f>i = Xifa, Kij := k(xi,Xj) (6) 
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for Xi, Xj E X, where (Aj, fa) are the eigenvalue and eigenvector pairs of K in the finite-dimensional 
subspace generated by the mapped data points, x ? ; k(xi, •). Kernel principal component analysis 
(KPCA) further scales the eigenvectors of K by their eigenvalues to achieve orthonormality. As the 
number of samples n —> oo, the approximation converges to the true eigenvalues and eigenfunctions 
Of (§ HSlij. 

3 Reduced Set KPCA 

This section proposes an alternative formulation of the operator and its spectral decomposition in 
order to derive reduced set KPCA, as based on an approximation to the empirically determined 
kernel density estimate. First, note that the integral equation leading to KPCA, Eq. ([ 2 ]), implies 
a kernel smoothing of the density (using the operator K applied to p ), 


(K p)(x) = / k(x,y)p(y)dy. 


( 7 ) 


Given a set of samples X = {x \,... ,x n } drawn from the density p and using © , the smoothed 
approximation ([T]) is obtained as 


1 n 

p{x) = (K p)(x) « - V k(xi,x), 

n z ' 


( 8 ) 


2=1 


which is known as the kernel density estimate (KDE) [17] . The KDE converges to p(x) under 
some mild assumptions, however using it can be expensive due to the 0(n) operations required to 
compute p(x ), thus it is common to utilize a reduced set density estimate 


p(x) = V ]wik(ci,x), 


( 9 ) 


2=1 


where W = {w\,... ,w m }, C = {ci,...,c m }, and m « n. The empirical density generating p 
under the kernel smoother K is 


^ m 

p(x) ~ — Wi5(ci,x). 

m ^ ^ 


( 10 ) 
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While having quite different generating approximations, the kernel smoothed density p is close to p 
by construction [5j[E, 20j. This paper will replace the KPCA procedure of the eigenproblem derived 
from © and ([ 5 ]) with one derived from ([ 9 ]) and ( |10| ) using an alternative, equivalent formulation 
of the continuous eigenproblem ([ 3 ]). The formulation considers the kernel 

k(x,y) =p 1/2 (x)k{x,y)p 1/2 (y), (11) 

which is a density weighted version of the original kernel. The eigenvalues of © are the same as 
those of © m- Therefore, the eigenproblem of ([ 3 ]) is the same as the eigenproblem 


k(x,y)fa(x)dx = X L fa(y), 


( 12 ) 


where the relationship between the two eigenvector sets is that fa(-) = p 1//2 ( - ) < / , t(’)- Using (10) and 
© in ( [T2| ) gives an eigendecomposition problem with the reduced set Gram matrix 


Kcfri — X K ij .— yj k(ci 5 Cj ) -yj Wj , 


(13) 
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Algorithm 1 Reduced Set KPCA 

Apply a reduced set density estimator to X to compute 
C = {ci,..., Cm} and w = {wi,..., w m }. 

Create diagonal matrix W = diag( % /wT, ■ ■ ■, y/w m )- 
Compute weighted kernel matrix 

AeR mxm , K := WK C W 

where K.f ( := k{d,Cj). 

Perform eigenvector decomposition K(j>i = A i4>i 
Reweight to get the eigenvectors <f>i = W 


for Ci, Cj £ C. The proposed reduced set KPCA procedure replaces the Gram matrix K in the 
empirical eigenproblem (|6j) by a density weighted surrogate 

K = WK c W t , 

where Kf, := k(ci,Cj), W = diag(y/wi ,..., 


is the weight matrix. The matrix K is an 
empirical, finite-dimensional approximation to (11). Unlike K , K c is an m x m matrix (as is K). 


Once the centers are selected and the weights computed using a reduced set density estimation 
algorithm, the original data is discarded. This makes the algorithm fundamentally different from 
Nystrom type methods which retain the training data for eigenfunction computations at test time, 
and both the sparse approximation and the eigenvector approximation methods which need to first 
compute the eigendecomposition of a full kernel matrix to generate the reduced set eigenfunction 
computations for testing. The algorithm can be more aggressive with the training data than either 
of these two strategies in pursuit of both training and testing speedups. The reduced set KPCA 
algorithm is summarized in Algorithm [l] Since the full kernel matrix is never computed once an 
RSDE is available, the training cost of the algorithm is 0(m 3 ) and the testing cost is 0(m). 

The key insight into the procedure is that an accurate reduced set density estimate must lead 
to a similarly accurate reduced set KPCA. This is seen by noting that the KDE and the RSDE 
both arise as empirical approximations to the same continuous eigenproblem. 

Extension to KMLAs. More generally, there is a class of manifold learning methods that 
can be reformulated as the following generic eigenproblem 

(Gf)(x)= [ g(x,y)k{x,y)f(y)p(y)dy. (14) 

J D 

If Q is a positive definite operator, it generates an RKHS T-L. An equivalent eigenproblem is of the 
form 


(' Gf)(x)= / g{x, y)f(y)k(x, y)dy. 


(15) 


ID 


Given algorithms where the integral operator is of the form (15) (such as diffusion maps, Laplacian 
eigenmaps, normalized cut etc), approximation algorithms similar to Algorithm [T] can be formu¬ 
lated. 


4 A Fast and Simple RSDE 

Here, a specific RSDE algorithm for use within RSKPCA, to improve the execution time of learning 
and testing versus KPCA, is given. By proposing a simple algorithm, closed form approximation 
errors are computable as explored in the subsequent section. 
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While many algorithms have been designed for reduced set density estimation, to meet our 
purposes, the RSDE must satisfy three criteria: 1) it must incorporate the kernel within its estimate; 
2) its computational cost cannot be excessive, as that would fail to speed up the KMLA; and 3) 
the number of centers m must be identified in a principled way, since they may vary from problem 
to problem, and must have deterministic approximation error. These three criteria are met by a 
simple algorithm exploiting the structure of radially symmetric kernels. An approach similar to the 
one proposed here is found in [16], however their selection parameter is not fundamentally related 
to the kernel bandwidth and they draw no connection to KPCA. 

Given a bounded kernel function k (■,■), where k is the maximum value attained at k(c, c), Vc E 
R d , and a sequence { y%\ie if ||c — Vi\\ —> 0, then k(c, yi) —> n (as * —> oo). Points sufficiently close 
to c seem indistinguishable from the perspective of the kernel centered at c. Declare such points 
near c to lie in the shadow of the kernel function at c. Given a dataset { Xi }" used to determine p(x), 
all points of the dataset in the shadow of another point c G {%i}i can be replaced with c at minor 
cost. Removing the now duplicate points requires an increase in the weight of c by the number of 
points removed in the KDE. Extending this idea further, suppose that there existed a collection 
of points from {a;*}™ whose e-balls covered the entire dataset (with e to be defined shortly), then 
points lying in these e-balls could be removed with minor effect, leading to the shadow density 
estimate: 


1 


1 


P(x) ■= - y]wjk{cj,x) K-VV H£,x) 

3= i j=i ZeSj 


(16) 


where Sj is the set of points lying in the shadow of the point Cj, Wj = \ Sj |, and S) n Sj = 0 
when i ^ j. This paper specializes to the case of radially symmetric kernels with bandwidth 
parameter a, and defines £ to be determined by a parameter £ via e(£) = ajl. What remains 
is to provide a selection procedure for the shadow centers Cj . Algorithm [2] provides a single-pass 
0(mn ) complexity approach^] Figure [I] conceptually depicts the process of moving from data to 
shadow centers, and also the reconstruction of the KDE using a ShKDE. The color coding depicts 
the distinct shadow sets. Based on [|2j the RSKPCA procedure follows as in Algorithm [lj The 
next section utilizes e(£) to analyze the effectiveness of the ShDE approximation and the fidelity 
of RSKPCA. The experiments section discusses other RSDEs, and compares ShDE to them in the 
context of RSKPCA. 


5 Analysis of Approximation Error 


This section derives bounds on the MMD error for shadow densities, plus bounds on the difference 
between the eigenvalues and spectral projections of the operators associated to the original kernel 
matrix generated by KPCA, K, and the one generated by the shadow density, K. The bounds 
demonstrate the claim that an accurate RSDE leads to an accurate eigendecomposition, since the 
bounds on the approximation error of the eigendecomposition are given in terms of the error of the 
approximated density estimate. 

Consider a set of points X = {xi ,..., x n }. sampled from the distribution p. Let the shadow 
centers be given by C = {ci,...,c m }, and define the data-to-center mapping a : {l,...,n} —>■ 
{1,..., m}. The shadow quantized dataset generated from X is given by C= {c q (d, ..., c Q ( n )}. 

Here, as in [20] . kernels that satisfy the following inequality are considered, 


(Ha, b ) - k(c, d)f < C%(\\a - 6|| 2 + ||b - d|| 2 ), 


(18) 


2 The parameter i implicitly determines the number m. 
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Algorithm 2 Shadow Set Selection Procedure 

Input: X = (xRVi ; bandwidth a, and t £ R+. 

Set C = 0, W = 0, nn = 0, and 

£ = a /l. 

(17) 

while X 7 ^ 0 do 

Let c be first element of X . 

Find shadow set S — {y £ X : \\y — c|| < e}. 

Update center set C = C U {c}. 

Update weight set W = W U {|S|}. 

Set X = X\S. 

end while 




Figure 1: Visualization of the data, the shadow centers, and the associated KDE and ShKDE. 


where C% is a constant depending on k and the sample set X, and that can be written as 


k(x,y) = <p 


\x - y\\ l 

aP 


(19) 


The Laplacian and Gaussian, in particular, satisfy (18) and (19) for cp(s) = e 
\ for the Laplacian, and is 9 ^ 2 for the Gaussian [19] 


The constant C\ 


is 


The maximum mean discrepancy (MMD) is a distance measure between probability distribu¬ 


tions in the Hilbert space TL induced by the kernel k [13] . The (biased) MMD is defined to be 


MMD(V,V)b : = 


1 

n 

i =1 


n 

n 


i= 1 


( 20 ) 


where the b denotes bias and if is the mapping from the input space M“ to TL, ip(x) := k(x,-). The 
points Xi G X and yi 6 y are generated by probability distributions p and q respectively; both 
sets have the same number of elements. The MMD can be thought of as the squared L 2 distance 
between two KDEs of the form ([8j (up to scaling factors induced by Ti) [14] , Since the kernel in 
KPCA induces a smoothing effect on the samples from the true probability density p , a small value 
for the MMD between the KDE and an RSDE is indicative of the RSDE acting as an effective 
surrogate for p in the KPCA space, thus generating an effective approximation to © via the use 
of Algorithm [lj 

The theorem below bounds the difference in MMD between the KDE p(x) and the ShDE p(x). 

Theorem 5.1. (MMD Worst Case Bound) Let n be the number of samples, X be defined as 
above, C be the quantized dataset, and let k satisfy (19). Then 


MMD(T, C) b < 



K — <p 


( 21 ) 


7 




















Proof. Follows from (19) and (20) through the identity X]c 6C w *^( c ») = Ylx-ex’&fcatt))’ which 
gives the ShDE and the KDE the same cardinality, n. □ 

The ShDE+RSKPCA procedure creates a matrix K that acts as an m x m surrogate for the 
quantized kernel matrix Kij = k(c a u\, c a (p), for i,j = 1... n. Exploiting the quantization effect, 
the following theorems bound the eigenvalue difference between the two spectral decompositions 
and also the difference between the operators in P in terms of the Hilbert-Schmidt norm. 


Theorem 5.2. Let k be such that (18) holds, and let A i and A i be the eigenvalues of the normalized 
matrices K andK respectively. Then 


Ai - A;) 2 < 2C X 


1=1 


k (<f_\ 

£j 


Proof. Follows from the Hoffman-Wielandt inequality and (18). 


□ 


Given a kernel function k and X , a finite dimensional operator K n : P —> P approximating the 
ideal operator Q can be defined via 


1 n 

Knl) '■= ~ y k Xi }'H,k Xi , 


( 22 ) 


i =1 


where Xi G X and (•, k Xi )u projects the point onto the kernel function k Xi := k(-, xf) G P [[10]. The 
operator can be used to bound the error in Hilbert-Schmidt norm between the empirical operators 
generated by KPCA and ShDE+RSKPCA. 


Theorem 5.3. Let K n and K n be defined using (22) with X and C, respectively. Then 


\K n —A+Ihs < 2fiq/2 \n - ip\ — 


(23) 


Proof. Define the operators K n and K n via the extrapolation (22) using k Xi and k Ca{i) respectively, 
and define the kernel residual in P to be e,- := k r — k r .... Then 


K n —K n = - 
n 


1 n 

- ^ (('i k Xi )H e i + (’> j 

i= 1 


leading to 


K — K < 
\ 1Y n -L'-n 11 hs — 


1 n -\ n 

i= 1 HS i= 1 




a(i) 


HS 


Using the properties of the Hilbert-Schmidt norm, and the maximizer e' such that the centroid 
error ||ej||% is largest, the theorem follows. □ 


Proposition |5.3| shows that the centroid error in P is the key to the performance of the learning 
algorithm, and that the error is controlled solely in terms of the parameter t. The independence 
of the performance from the weights shows that ShDE effectively learns the percentage of the data 
that needs to be retained based on the value of £, which is dependent on the kernel and not the 
data. Finally, i controls both the MMD and operator approximations, implying that the density 
estimate used in the shadow density procedure is sensible for learning in the eigenspace. Using this 
result, the following theorem follows. 

















Theorem 5.4. Let K n andK n be symmetric positive (finite) Hilbert-Schmidt operator ofH defined 
by (22), and assume that K n has simple nonzero eigenvalues X\ > A 2 > ■ ■ • > X n . Let D > 0 be an 
integer such that Xp > 0, 5n = | (A d — A^+i)- //2y / K||e / ||% < 5d/ 2, then 


P D (K n )-P 1J (K n ) ||hs < 


D / 


2 ^/ 2 K 


Sd 


(24) 


where P D (A ) denotes the projection onto the D-dimensional eigenspace of A E HS(T^) associated 
to the largest eigenvalues. 


Proof. Follows from Theorem 3 in [21] and Proposition 5.3 


□ 


6 Experimental Results 

This section demonstrates the effectiveness of RSKPCA on real-world data. Approximation accu¬ 
racy tests include eigenembedding and classification tasks with the Gaussian kernel. The datasets 
used and the bandwidths chosen (via cross-validation) are given in Table [I] In the figures, nt refers 
to the number of the points the model is trained on. All of the comparison algorithms require 
specification of the reduced set size m. To compare, the shadow method is run with £ then the 
average of all rn achieved on the datasets determines the value m for the other methods. Table [2] 
compares the training time and storage size (which relates to evaluation time). All comparisons 
are made with KPCA as the baseline. Speedup is relative to the equivalent KPCA execution time. 
Eigenembedding comparison with Nystrom methods. This experiment demonstrates the 
fidelity of the eigenfunctions computed by ShDE+RSKPCA to those generated by KPCA. The ca¬ 
pacity of generalization of the approximate eigenfunctions is tested. Using KPCA as the baseline, 
ShDE+RSKPCA is compared with three other methods: 1) subsampled KPCA with bases cho¬ 
sen via random uniform sampling, 2) the regular Nystrom method with bases chosen via random 
uniform sampling, and 3) the density weighted Nystrom (WNystrom) method [20]. The experi¬ 
mental methodology is as follows. First, the KPCA model is trained on the entire dataset. Then, 
shadow, uniform, Nystrom, and WNystrom KPCA models are trained using 80% of the data for 
£ E [3.0, 5.0], in increments of 0.1. The reason £ = 3.0 is chosen as a lower bound for the Gaussian 
is because lower values of i pick points that are no longer similar to the centroid, while £ > 5 
generally results in a loss in training efficiency. The KPCA eigenfunction embedding is computed 
for the remaining 20% of the data for all the models, with rank r = 5. The embeddings are aligned 
with each other using the transform argmin^ggrxr ||0 — OA||j7, where O is the matrix representing 
the KPCA embedding, and O represents the approximate KPCA embedding. The Frobenius norm 
difference of the embeddings and eigenvalues, the training and testing speedup, and the amount of 
data retained are averaged over 50 runs for each £, and are shown in Figures [2] and [3] for the german 
and pendigits datasets. As expected, while subsampled KPCA is faster in the training stage, it 
performs worse than any other method, implying that an appropriate weighting is necessary to 
approximate the eigenfunctions of KPCA. For larger values of l, ShDE+RSKPCA always performs 
well when it comes to approximating the eigenvalues and eigenfunctions of the operator. In terms 
of eigenembedding accuracy, using ANOVA with a value of a = 0.05, ShDE+RSKPCA is better 
than the Nystrom embeddings after £ = 3.2(3.3) and no worse than the WNystrom embeddings 
after l = 4.0(4.8) for pendigits (german), and asymptotically approaches the KPCA baseline. 
While slower than the Nystrom method for training, ShDE+RSKPCA is faster than KPCA for 
training and achieves significant testing speedups. It does so by retaining a subset of the data via 
selection of £, c.f. Fig. ®a,b). 
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(a) Eigenvalue Deviation (b) Embedding Accuracy (c) Eigen speedup 

Figure 2: Eigenembedding comparison w/Nystrom methods for german 


(d) Testing speedup 
as l is varied (nt = 800). 
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(a) Eigenvalue Deviation (b) Embedding Accuracy (c) Eigen speedup 

Figure 3: Eigenembedding comparison w/Nystrom for pendigits as i 



(d) Testing speedup 


is varied {nt = 2, 800). 


KPCA classification comparison with Nystrom methods. This experiment examines the 
effectiveness of ShDE+RSKPCA for classification compared with the Nystrom methods used pre¬ 
viously. Classification utilizes the /c-nn classifier with k = 3, using 10-fold cross-validation. The 
accuracy, training and testing speedups, and the percentage of data are reported. The results are 
shown in Figures [4] and [5] for the usps and yale methods respectively (none = KPCA). For the 
k -nn classification case, ShDE+RSKPCA has competitive accuracy with the Nystrom methods, 
while providing significant training and testing speedups. The training speedup over the Nystrom 
method in this case is because the eigenembedding of the data needs to be computed as part of the 
/c-rm classifier training. Note that the data retained here, Fig. [6](c,d), is less than 10% for £ E [3, 5], 
implying noticeable speedup in the KPCA step of the classifier (during training and evaluation). 

RSKPCA with different RSDE schemes. RSKPCA is performed using alternative RSDEs 
to demonstrate the influence of the RSDE algorithm on accuracy, Figs. [7] and [8j Following [20]. 
fc-means provides a means to generate an RSDE at a time complexity of 0{mn ) (but tends to 
be slow due being iterative). Second, KDE paring [8] subsamples from the original dataset and 
computes the estimate from the reduced set, at an 0{m ) cost. Third, kernel herding is examined 
[5], which provides a mechanism to sample from a KDE using a nonlinear dynamical system. The 
samples are shown to be good representative samples. Their generation is 0(n 2 m). All of these 
algorithms require the user to provide the number m. It can be seen that the quality of the RSDE 
does influence the accuracy for small £, less so for larger i. The center selection schemes that lead 
to improved accuracy are costlier than ShDE, thus decreasing training gains. Evaluation speedup 
is the same for all methods. 
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Table 1: Datasets used. 



german 

pendigits 

usps 

yale 

n 

1,000 

3,500 

9,298 

5,768 

DIM 

24 

16 

256 

520 

CLASSES 

2 

10 

10 

10 

k 

5 

5 

15 

10 

a 

30 

120 

18 

17 


Table 2: Training cost and storage comparison. 
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0{mn + m 3 ) 

0(mn + m 3 ) 

0{mn + m 3 ) 
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0(nr) 



3 3.5 4 4.5 
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(a) Accuracy 



(b) Training speedup 



— shadows 
— nystrom 
— wnystrom 





3.5 4 
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4.5 


(c) Testing speedup 



(d) Total speedup 


Figure 4: Classification comparison w/Nystrom for usps as l is varied (n* = 8, 368). 
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Figure 5: Classification comparison w/Nystrom methods for yale as l is varied (nt = 5,191). 



Figure 6: Percentage of data retained. 


7 Conclusion 

This paper presented (1) a reduced set KPCA algorithm for speeding up KPCA given a reduce 
set density estimate of the training data, and (2) a simple, efficient, single-pass algorithm for 
generating a suitable RSDE, called the shadow density estimate (ShDE), which relies on a user- 
selected parameter i. The spectral decomposition error was shown to be bounded and directly 
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Figure 7: Classification comparisons w/RSDEs for usps as i is varied (nt = 8,368). 
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Figure 8: Classification comparisons w/RSDEs for yale as I is varied (nt = 5,191). 


related to the bound of the empirical error of the ShDE. Through ShDE+RSKPCA, significant 
reductions in both training and evaluation time are achieved with minimal performance loss for 
large, redundant datasets. Competitive overall speedups and performance were achieved versus 
Nystrom methods. 
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